Chapter 5 - New Developments: Topic Modeling with BERTopic!#

2022 July 30

bertopic

What is BERTopic?#

  • As part of NLP analysis, it’s likely that at some point you will be asked, “What topics are most common in these documents?”

    • Though related, this question is definitely distinct from a query like “What words or phrases are most common in this corpus?”

      • For example, the sentences “I enjoy learning to code.” and “Educating myself on new computer programming techniques makes me happy!” contain wholly unique tokens, but encode a similar sentiment.

      • If possible, we would like to extract generalized topics instead of specific words/phrases to get an idea of what a document is about.

  • This is where BERTopic comes in! BERTopic is a cutting-edge methodology that leverages the transformers defining the base BERT technique along with other ML tools to provide a flexible and powerful topic modeling module (with great visualization support as well!)

  • In this notebook, we’ll go through the operation of BERTopic’s key functionalities and present resources for further exploration.

Required installs:#

# Installs the base bertopic module:
!pip install bertopic 

# If you want to use other transformers/language backends, it may require additional installs: 
!pip install bertopic[flair] # can substitute 'flair' with 'gensim', 'spacy', 'use'

# bertopic also comes with its own handy visualization suite: 
!pip install bertopic[visualization]
Requirement already satisfied: bertopic in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (0.11.0)
Requirement already satisfied: umap-learn>=0.5.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (0.5.3)
Requirement already satisfied: scikit-learn>=0.22.2.post1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (1.1.2)
Requirement already satisfied: sentence-transformers>=0.4.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (2.2.2)
Requirement already satisfied: pandas>=1.1.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (1.4.3)
Requirement already satisfied: tqdm>=4.41.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (4.64.0)
Requirement already satisfied: pyyaml<6.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (5.4.1)
Requirement already satisfied: numpy>=1.20.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (1.22.4)
Requirement already satisfied: plotly>=4.7.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (5.10.0)
Requirement already satisfied: hdbscan>=0.8.28 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (0.8.28)
Requirement already satisfied: cython>=0.27 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from hdbscan>=0.8.28->bertopic) (0.29.32)
Requirement already satisfied: scipy>=1.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from hdbscan>=0.8.28->bertopic) (1.9.0)
Requirement already satisfied: joblib>=1.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from hdbscan>=0.8.28->bertopic) (1.1.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from pandas>=1.1.5->bertopic) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from pandas>=1.1.5->bertopic) (2022.1)
Requirement already satisfied: tenacity>=6.2.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from plotly>=4.7.0->bertopic) (8.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from scikit-learn>=0.22.2.post1->bertopic) (3.1.0)
Requirement already satisfied: torch>=1.6.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (1.12.1)
Requirement already satisfied: huggingface-hub>=0.4.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (0.8.1)
Requirement already satisfied: nltk in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (3.7)
Requirement already satisfied: transformers<5.0.0,>=4.6.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (4.21.1)
Requirement already satisfied: sentencepiece in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (0.1.97)
Requirement already satisfied: torchvision in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (0.13.1)
Requirement already satisfied: numba>=0.49 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from umap-learn>=0.5.0->bertopic) (0.56.0)
Requirement already satisfied: pynndescent>=0.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from umap-learn>=0.5.0->bertopic) (0.5.7)
Requirement already satisfied: requests in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2.28.1)
Requirement already satisfied: filelock in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (3.8.0)
Requirement already satisfied: packaging>=20.9 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (21.3)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (4.3.0)
Requirement already satisfied: setuptools in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from numba>=0.49->umap-learn>=0.5.0->bertopic) (63.4.1)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from numba>=0.49->umap-learn>=0.5.0->bertopic) (0.39.0)
Requirement already satisfied: six>=1.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas>=1.1.5->bertopic) (1.16.0)
Requirement already satisfied: tokenizers!=0.11.3,<0.13,>=0.11.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (0.12.1)
Requirement already satisfied: regex!=2019.12.17 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (2022.7.25)
Requirement already satisfied: click in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from nltk->sentence-transformers>=0.4.1->bertopic) (8.1.3)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from torchvision->sentence-transformers>=0.4.1->bertopic) (9.2.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from packaging>=20.9->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (3.0.9)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (1.26.11)
Requirement already satisfied: charset-normalizer<3,>=2 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2.1.0)
Requirement already satisfied: idna<4,>=2.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (3.3)
Requirement already satisfied: certifi>=2017.4.17 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2022.6.15)
zsh:1: no matches found: bertopic[flair]
zsh:1: no matches found: bertopic[visualization]

Data sourcing#

  • For this exercise, we’re going to use a popular data set, ‘20 Newsgroups,’ which contains ~18,000 newsgroups posts on 20 topics. This dataset is readily available to us through Scikit-Learn:

import bertopic
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

documents = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']

print(documents[0]) # Any ice hockey fans? 
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!

Creating a BERTopic model:#

  • Using the BERTopic module requires you to fetch an instance of the model. When doing so, you can specify multiple different parameters including:

    • language -> the language of your documents

    • min_topic_size -> the minimum size of a topic; increasing this value will lead to a lower number of topics

    • embedding_model -> what model you want to use to conduct your word embeddings; many are supported!

Example instantiation:#

from sklearn.feature_extraction.text import CountVectorizer 

# example parameter: a custom vectorizer model can be used to remove stopwords from the documents: 
stopwords_vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english') 

# instantiating the model: 
model = BERTopic(vectorizer_model = stopwords_vectorizer)

Fitting the model:#

  • The first step of topic modeling is to fit the model to the documents:

topics, probs = model.fit_transform(documents)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 topics, probs = model.fit_transform(documents)

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/bertopic/_bertopic.py:301, in BERTopic.fit_transform(self, documents, embeddings, y)
    298 if embeddings is None:
    299     self.embedding_model = select_backend(self.embedding_model,
    300                                           language=self.language)
--> 301     embeddings = self._extract_embeddings(documents.Document,
    302                                           method="document",
    303                                           verbose=self.verbose)
    304     logger.info("Transformed documents to Embeddings")
    305 else:

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/bertopic/_bertopic.py:2035, in BERTopic._extract_embeddings(self, documents, method, verbose)
   2033     embeddings = self.embedding_model.embed_words(documents, verbose)
   2034 elif method == "document":
-> 2035     embeddings = self.embedding_model.embed_documents(documents, verbose)
   2036 else:
   2037     raise ValueError("Wrong method for extracting document/word embeddings. "
   2038                      "Either choose 'word' or 'document' as the method. ")

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/bertopic/backend/_base.py:69, in BaseEmbedder.embed_documents(self, document, verbose)
     55 def embed_documents(self,
     56                     document: List[str],
     57                     verbose: bool = False) -> np.ndarray:
     58     """ Embed a list of n words into an n-dimensional
     59     matrix of embeddings
     60 
   (...)
     67         that each have an embeddings size of `m`
     68     """
---> 69     return self.embed(document, verbose)

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/bertopic/backend/_sentencetransformers.py:63, in SentenceTransformerBackend.embed(self, documents, verbose)
     49 def embed(self,
     50           documents: List[str],
     51           verbose: bool = False) -> np.ndarray:
     52     """ Embed a list of n documents/words into an n-dimensional
     53     matrix of embeddings
     54 
   (...)
     61         that each have an embeddings size of `m`
     62     """
---> 63     embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
     64     return embeddings

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py:165, in SentenceTransformer.encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, device, normalize_embeddings)
    162 features = batch_to_device(features, device)
    164 with torch.no_grad():
--> 165     out_features = self.forward(features)
    167     if output_value == 'token_embeddings':
    168         embeddings = []

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/container.py:139, in Sequential.forward(self, input)
    137 def forward(self, input):
    138     for module in self:
--> 139         input = module(input)
    140     return input

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/sentence_transformers/models/Transformer.py:66, in Transformer.forward(self, features)
     63 if 'token_type_ids' in features:
     64     trans_features['token_type_ids'] = features['token_type_ids']
---> 66 output_states = self.auto_model(**trans_features, return_dict=False)
     67 output_tokens = output_states[0]
     69 features.update({'token_embeddings': output_tokens, 'attention_mask': features['attention_mask']})

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:1018, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   1009 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
   1011 embedding_output = self.embeddings(
   1012     input_ids=input_ids,
   1013     position_ids=position_ids,
   (...)
   1016     past_key_values_length=past_key_values_length,
   1017 )
-> 1018 encoder_outputs = self.encoder(
   1019     embedding_output,
   1020     attention_mask=extended_attention_mask,
   1021     head_mask=head_mask,
   1022     encoder_hidden_states=encoder_hidden_states,
   1023     encoder_attention_mask=encoder_extended_attention_mask,
   1024     past_key_values=past_key_values,
   1025     use_cache=use_cache,
   1026     output_attentions=output_attentions,
   1027     output_hidden_states=output_hidden_states,
   1028     return_dict=return_dict,
   1029 )
   1030 sequence_output = encoder_outputs[0]
   1031 pooled_output = self.pooler(sequence_output) if self.pooler is not None else None

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:607, in BertEncoder.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    598     layer_outputs = torch.utils.checkpoint.checkpoint(
    599         create_custom_forward(layer_module),
    600         hidden_states,
   (...)
    604         encoder_attention_mask,
    605     )
    606 else:
--> 607     layer_outputs = layer_module(
    608         hidden_states,
    609         attention_mask,
    610         layer_head_mask,
    611         encoder_hidden_states,
    612         encoder_attention_mask,
    613         past_key_value,
    614         output_attentions,
    615     )
    617 hidden_states = layer_outputs[0]
    618 if use_cache:

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:535, in BertLayer.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
    532     cross_attn_present_key_value = cross_attention_outputs[-1]
    533     present_key_value = present_key_value + cross_attn_present_key_value
--> 535 layer_output = apply_chunking_to_forward(
    536     self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
    537 )
    538 outputs = (layer_output,) + outputs
    540 # if decoder, return the attn key/values as the last output

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/transformers/pytorch_utils.py:243, in apply_chunking_to_forward(forward_fn, chunk_size, chunk_dim, *input_tensors)
    240     # concatenate output at same dimension
    241     return torch.cat(output_chunks, dim=chunk_dim)
--> 243 return forward_fn(*input_tensors)

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:548, in BertLayer.feed_forward_chunk(self, attention_output)
    546 def feed_forward_chunk(self, attention_output):
    547     intermediate_output = self.intermediate(attention_output)
--> 548     layer_output = self.output(intermediate_output, attention_output)
    549     return layer_output

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:460, in BertOutput.forward(self, hidden_states, input_tensor)
    459 def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
--> 460     hidden_states = self.dense(hidden_states)
    461     hidden_states = self.dropout(hidden_states)
    462     hidden_states = self.LayerNorm(hidden_states + input_tensor)

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
    113 def forward(self, input: Tensor) -> Tensor:
--> 114     return F.linear(input, self.weight, self.bias)

KeyboardInterrupt: 
  • .fit_transform() returns two outputs:

    • topics contains mappings of inputs (documents) to their modeled topic (alternatively, cluster)

    • probs contains a list of probabilities that an input belongs to their assigned topic

  • Note: fit_transform() can be substituted with fit(). fit_transform() allows for the prediction of new documents but demands additional computing power/time.

Viewing topic modeling results:#

  • The BERTopic module has many built-in methods to view and analyze your fitted model topics. Here are some basics:

# view your topics: 
topics_info = model.get_topic_info()

# get detailed information about the top five most common topics: 
print(topics_info.head(5))
   Topic  Count                                       Name
0     -1   6646                     -1_file_use_need_using
1      0   1838                0_team_games_players_season
2      1    616              1_clipper_encryption_chip_nsa
3      2    527  2_cheek ken_ken huh_ignore art_huh ignore
4      3    452          3_israel_israeli_jews_palestinian
  • When examining topic information, you may see a topic with the assigned number ‘-1.’ Topic -1 refers to all input outliers which do not have a topic assigned and should typically be ignored during analysis.

  • Forcing documents into a topic could decrease the quality of the topics generated, so it’s usually a good idea to allow the model to discard inputs into this ‘Topic -1’ bin.

# access a single topic: 
print(model.get_topic(topic=0)) # .get_topics() accesses all topics
[('team', 0.007645058778587724), ('games', 0.006112662299637617), ('players', 0.005412026399964582), ('season', 0.005342811826876292), ('hockey', 0.005239065199444112), ('league', 0.004280045353200042), ('teams', 0.003990602953367509), ('baseball', 0.0037812052034601833), ('nhl', 0.003514144827427642), ('gm', 0.0029900018153221084)]
# get representative documents for a specific topic: 
print(model.get_representative_docs(topic=0)) # omit the 'topic' parameter to get docs for all topics 
["\ni have no idea, nor do i care.  however, i'd like to point out that\nblomberg got the first plate appearance by a designated hitter, and\nthe first walk by a designated hitter.  i am not sure, but i do not\nthink that he also got the first hit by a designated hitter.", ": >\n: >ATLANTIC DIVISION\n: >\t\n: >\tST JOHN'S MAPLE LEAFS VS MONCTON HAWKS\n: >\tMONCTON HAWKS\n: >See CD Islanders. Moncton is a very similar team to CDI. Low scoring,\n: >defensive, good goaltending. John Leblanc and Stu Barnes are the only\n: >noticable guns on the team. But the defense is top notch and \n: >Mike O'Neill is the most underrated goalie in the league.\n: >\n\n: Bri, as I have tried to tell you since 2 February, Michael O'Neill\n: might be the most underrated goalie in the AHL, but he ISN'T in the\n: AHL.  He's on the Winnipeg Jets' injury list, as he has been since\n: his first NHL start against the Ottawa Senators.  He's out until\n: next year after surgery to repair a shoulder separation.\n\n: Stu Barnes might be an AHL gun for the Hawks, but he's now the third\n: line center with the Jets, and has been since mid January or so.\n\nSorry, my memory is gone. I thought that O'Neill got sent back\ndown in February but I must have been given incorrect info. I guess\nthis says it all about Moncton because Barnes is still one of\ntheir top 3 or so scorers even though he's been out since January.", "\n\nI didn't see any smilies in this message so.......\n\n                W     T    L    PTs\n   Team A      50    30    4    104\n   Team B      52    32    0    104\n\n\nThere you go.  Two teams that tie in points without identical records.\n\n"]
# find topics similar to a key term/phrase: 
topics, similarity_scores = model.find_topics("sports", top_n = 5)
print("Most common topics:" + str(topics)) # view the numbers of the top-5 most similar topics

# print the initial contents of the most similar topics
for topic_num in topics: 
    print('\nContents from topic number: '+ str(topic_num) + '\n')
    print(model.get_topic(topic_num))
    
Most common topics:[0, 30, 6, 166, 4]

Contents from topic number: 0

[('team', 0.007645058778587724), ('games', 0.006112662299637617), ('players', 0.005412026399964582), ('season', 0.005342811826876292), ('hockey', 0.005239065199444112), ('league', 0.004280045353200042), ('teams', 0.003990602953367509), ('baseball', 0.0037812052034601833), ('nhl', 0.003514144827427642), ('gm', 0.0029900018153221084)]

Contents from topic number: 30

[('games', 0.03260548961663573), ('sega', 0.02366315012814771), ('arcade', 0.012166539858844822), ('snes', 0.010883627526511617), ('sega genesis', 0.01081910740506706), ('joysticks', 0.010294764495945618), ('games sale', 0.010085068481475858), ('sale', 0.00964091677280479), ('joystick', 0.009006639792149954), ('sega cd', 0.0074012373591723)]

Contents from topic number: 6

[('riding', 0.011792240692170709), ('ride', 0.011256591323418531), ('driving', 0.007418204752466058), ('road', 0.007362304673149508), ('traffic', 0.006971330162717447), ('roads', 0.005093305390738552), ('bikes', 0.0046328368271995445), ('bikers', 0.0041220512073587194), ('riders', 0.0037367046265679754), ('passengers', 0.0035386604055364823)]

Contents from topic number: 166

[('religion', 0.024810151190057972), ('war', 0.01958713595572545), ('wars', 0.0141305144151792), ('crusades', 0.012827683749926261), ('history', 0.01202363443416338), ('religious', 0.009458363539211138), ('unbelievers', 0.008338773663764506), ('yoked unbelievers', 0.007970064155940823), ('statement religion', 0.007495172035922859), ('gods', 0.0071255212864334274)]

Contents from topic number: 4

[('health', 0.0072259305085357), ('cancer', 0.005975505039095839), ('disease', 0.00513078203584376), ('tobacco', 0.005069613472607038), ('medical', 0.00492433353954727), ('hiv', 0.004709304265420622), ('malaria', 0.004112010029452724), ('smokeless tobacco', 0.004033769948845448), ('lyme', 0.003923377448522405), ('medical newsletter', 0.003903230753928965)]

Saving/loading models:#

  • One of the most obvious drawbacks of using the BERTopic technique is the algorithm’s run-time. But, rather than re-running a script every time you want to conduct topic modeling analysis, you can simply save/load models!

# save your model: 
# model.save("TAML_ex_model")
# load it later: 
# loaded_model = BERTopic.load("TAML_ex_model")

Visualizing topics:#

  • Although the prior methods can be used to manually examine the textual contents of topics, visualizations can be an excellent way to succinctly communicate the same information.

  • Depending on the visualization, it can even reveal patterns that would be much harder/impossible to see through textual analysis - like inter-topic distance!

  • Let’s see some examples!

# Create a 2D representation of your modeled topics & their pairwise distances: 
model.visualize_topics()
# Get the words and probabilities of top topics, but in bar chart form! 
model.visualize_barchart()
# Evaluate topic similarity through a heat map: 
model.visualize_heatmap()

Conclusion#